Compression and Sieve: Reducing Communication in Parallel Breadth First Search on Distributed Memory Systems
نویسندگان
چکیده
For parallel breadth first search (BFS) algorithm on large-scale distributed memory systems, communication often costs significantly more than arithmetic and limits the scalability of the algorithm. In this paper we sufficiently reduce the communication cost in distributed BFS by compressing and sieving the messages. First, we leverage a bitmap compression algorithm to reduce the size of messages before communication. Second, we propose a novel distributed directory algorithm, cross directory, to sieve the redundant data in messages. Experiments on a 6,144-core SMP cluster show our algorithm outperforms the baseline implementation in Graph500 by 2.2 times, reduces its communication time by 79.0%, and achieves a performance rate of 12.1 GTEPS (billion edge visits per second).
منابع مشابه
Graph partitioning for scalable distributed graph computations
Inter-node communication time constitutes a significant fraction of the execution time of graph algorithms on distributed-memory systems. Global computations on large-scale sparse graphs with skewed degree distributions are particularly challenging to optimize for, as prior work shows that it is difficult to obtain balanced partitions with low edge cuts for these graphs. In this work, we attemp...
متن کاملUsing Tadpoles to Reduce
A parallel variant of breadth-rst search for distributed computing is presented. The variant allows exhaustive enumeration of elements of a search space (implicitly deened graph) in which the representation of all graph nodes would otherwise require more than the total available memory. This algorithm requires the use of a tadpole data structure to partition the search space into connected subg...
متن کاملOptimizing Communication by Compression for Multi-GPU Scalable Breadth-First Searches
The Breadth First Search (BFS) algorithm is the foundation and building block of many higher graph-based operations such as spanning trees, shortest paths and betweenness centrality. The importance of this algorithm increases each day due to it is a key requirement for many data structures which are becoming popular nowadays. When the BFS algorithm is parallelized by distributing the graph betw...
متن کاملDistributed Graph Layout for Scalable Small-world Network Analysis
The in-memory graph layout or organization has a considerable impact on the time and energy efficiency of distributed memory graph computations. It affects memory locality, inter-task load balance, communication time, and overall memory utilization. Graph layout could refer to partitioning or replication of vertex and edge arrays, selective replication of data structures that hold meta-data, an...
متن کاملParallelizing Irregular Applications through the YAPPA Compilation Framework
Modern High Performance Computing (HPC) clusters are composed of hundred of nodes integrating multicore processors with advanced cache hierarchies. These systems can reach several petaflops of peak performance, but are optimized for floating point intensive applications, and regular, localizable data structures. The network interconnection of these systems is optimized for bulk, synchronous tra...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- CoRR
دوره abs/1208.5542 شماره
صفحات -
تاریخ انتشار 2012